merge upstream/main: replace AP+NN cache with unified ContainerProfileCache (#788)#36
Merged
Conversation
* feat: foundation for ContainerProfileCache unification (steps 1, 2, 5-early)
Additive-only scaffolding for the upcoming migration from the two
workload-keyed caches (applicationprofilecache + networkneighborhoodcache)
to a single container-keyed ContainerProfileCache. No consumers are
rewired yet; all new code is unused.
- Storage client: GetContainerProfile(namespace, name) on ProfileClient
interface + *Storage impl + mock.
- ContainerProfileCache interface + stub impl (methods return zero values;
filled in by step 3/4).
- Prometheus metrics: nodeagent_user_profile_legacy_loads_total{kind,completeness}
deprecation counter + reconciler SLO metrics (entries gauge, hit/miss
counter, tick duration histogram, eviction counter) registered up front
so later steps emit cleanly.
Plan artifacts in .omc/plans/; approved by ralplan Planner/Architect/Critic
consensus (v2, iteration 2).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: ContainerProfileCacheImpl + projection + shared-pointer fast-path (steps 3, 3.5, 4)
- CachedContainerProfile entry with Shared/RV/UserAP/UserNNRV fields
- Option A+ fast-path: shared storage pointer when no user overlay
- projection.go ports mergeContainers/mergeNetworkNeighbors from legacy caches
- partial-profile detection with dedup'd WARN log + completeness metric label
- Event-path delete with WithLock+ReleaseLock (Critic #2 lock-gap fix)
- Unit tests T4 (projection) + T6 (callstack parity) + fast-path identity
Step 5 (reconciler) and legacy deletion land in follow-ups.
Plan: .omc/plans/containerprofile-cache-unification-consensus.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: ContainerProfileCache reconciler with evict + refresh (step 5)
- tickLoop drives evict + refresh on one goroutine, refresh gated by atomic
- reconcileOnce evicts entries whose pod is gone or container stopped
- refreshAllEntries snapshots IDs then refreshes outside Range to avoid a
SafeMap RLock/WLock deadlock (rebuildEntry calls Set)
- isContainerRunning(pod, entry, id): containerID primary, (Name, PodUID)
fallback for pre-running init containers with empty ContainerID
- ctx.Err() honored inside Range callbacks for graceful shutdown
- T8 end-to-end test: user-AP mutation -> cached projection reflects change
Plan: .omc/plans/containerprofile-cache-unification-consensus.md
Consensus deltas applied: #1 (isContainerRunning signature), #3 (ctx.Err),
#4 (extend fast-skip to overlay RVs), #5 (T8 test), #7 (RPC-cost comment).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: profilehelper CP->legacy-shape shims + ContainerProfileCache aggregator wiring (step 6a)
Adds the ContainerProfileCache reader to the ObjectCache aggregator interface
so profilehelper can read CP and synthesize the legacy ApplicationProfileContainer /
NetworkNeighborhoodContainer shapes for callers that haven't migrated yet.
- pkg/objectcache/objectcache_interface.go: add ContainerProfileCache() to
aggregator interface + mock (both AP/NN stay for 6a-6c transit)
- pkg/objectcache/v1/objectcache.go: add cp field, 5-arg NewObjectCache,
ContainerProfileCache() accessor
- pkg/objectcache/v1/mock.go: RuleObjectCacheMock implements CP surface +
Get/SetContainerProfile test helpers, Start stub
- pkg/rulemanager/profilehelper/profilehelper.go:
- GetContainerProfile(objectCache, id) returns (*CP, syncChecksum, error)
— the forward API
- GetContainerApplicationProfile + GetContainerNetworkNeighborhood rewritten
as ~30-LOC CP->legacy-shape shims (consensus delta #2). Marked deprecated;
step 6c deletes them after CEL callers migrate.
- cmd/main.go: construct ContainerProfileCache alongside APC+NNC, pass to
NewObjectCache; mock-path uses ContainerProfileCacheMock
- test call sites updated for 5-arg NewObjectCache
Plan: .omc/plans/containerprofile-cache-unification-consensus.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor: migrate 20 CEL call sites to GetContainerProfile (step 6b)
- applicationprofile/{capability,exec,http,open,syscall}.go: read fields
directly off cp.Spec instead of the per-container AP shape
- networkneighborhood/network.go: read Ingress/Egress/LabelSelector off
cp.Spec directly
- pkg/objectcache/v1/mock.go: extend RuleObjectCacheMock so
SetApplicationProfile / SetNetworkNeighborhood also project into the
unified ContainerProfile, and GetContainerProfile honours the shared
container-ID registry (preserves "invalid container ID -> no profile"
semantics for existing tests)
- profilehelper CP->legacy shims remain in place; step 6c removes them
Plan: .omc/plans/containerprofile-cache-unification-consensus.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor: delete profilehelper shims + migrate rule_manager + creator (step 6c)
- pkg/rulemanager/profilehelper/profilehelper.go: delete
GetContainerApplicationProfile, GetContainerNetworkNeighborhood,
GetApplicationProfile, GetNetworkNeighborhood, GetContainerFromApplicationProfile,
GetContainerFromNetworkNeighborhood — CP-direct API is the only surface now
- pkg/rulemanager/rule_manager.go:
- :202, :399 call profilehelper.GetContainerProfile instead of the shim
- HasFinalApplicationProfile reads cp via ContainerProfileCache().GetContainerProfile;
method name preserved (external API on RuleManagerInterface per plan v2 §2.4)
- pkg/rulemanager/rulepolicy.go: Validate takes *v1beta1.ContainerProfile
and reads cp.Spec.PolicyByRuleId
- pkg/rulemanager/ruleadapters/creator.go: both AP + NN branches use
ContainerProfileCache().GetContainerProfileState (unified state source)
Plan: .omc/plans/containerprofile-cache-unification-consensus.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor: ObjectCache aggregator CP-only + collapse 2 callbacks to 1 (step 6d)
- pkg/objectcache/objectcache_interface.go: drop ApplicationProfileCache()
and NetworkNeighborhoodCache() methods — the aggregator is now
{K8s, ContainerProfile, Dns}
- pkg/objectcache/v1/objectcache.go: 3-arg NewObjectCache(k, cp, dc)
- pkg/containerwatcher/v2/container_watcher_collection.go:63-64: two
ContainerCallback subscriptions (APC + NNC) collapse to one (CPC)
- cmd/main.go: both branches (runtime-detection + mock) construct only
ContainerProfileCache + Dns; legacy APC/NNC wiring removed with startup
log: "ContainerProfileCache active; legacy AP/NN caches removed"
- test call sites updated for 3-arg NewObjectCache
Legacy packages still physically present (imports retained where still
referenced, e.g. callstackcache); step 8 deletes them entirely.
Plan: .omc/plans/containerprofile-cache-unification-consensus.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor: delete legacy AP/NN cache packages + move callstackcache (step 8)
- git rm -r pkg/objectcache/applicationprofilecache/ (766 LOC)
- git rm -r pkg/objectcache/networkneighborhoodcache/ (758 LOC)
- git rm pkg/objectcache/applicationprofilecache_interface.go
- git rm pkg/objectcache/networkneighborhoodcache_interface.go
- mv pkg/objectcache/applicationprofilecache/callstackcache/
-> pkg/objectcache/callstackcache/ (domain-agnostic, shared)
- Update 4 importers: containerprofilecache_interface.go, v1/mock.go,
containerprofilecache.go, reconciler.go
- RuleObjectCacheMock drops ApplicationProfileCache()/NetworkNeighborhoodCache()
accessor methods; SetApplicationProfile/SetNetworkNeighborhood remain as
test helpers that project into the unified CP
- projection.go comments kept as historical source pointers — git history
preserves the originals
Plan: .omc/plans/containerprofile-cache-unification-consensus.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: add T2 init-eviction, T5 packages-deleted, T7 lock-stress (step 9 partial)
- tests/containerprofilecache/packages_deleted_test.go: go/packages
dep-graph assertion that legacy AP/NN paths are absent
- tests/containerprofilecache/lock_stress_test.go: 100-goroutine
interleaved seed/read for same container IDs, 5s budget, race-safe
- tests/containerprofilecache/init_eviction_test.go: T2a (event-path
evict) + T2b (reconciler-path evict for missed RemoveContainer)
- tests/containerprofilecache/helpers_test.go: shared test builders
- pkg/objectcache/containerprofilecache: export ReconcileOnce and
SeedEntryForTest as out-of-package test hooks
- Makefile: check-legacy-packages target
T1 (golden-alert parity) and T3 (memory benchmark) are release-checklist
items per plan v2 §2.7 — the pre-migration baselines those tests require
can no longer be captured from this branch.
Plan: .omc/plans/containerprofile-cache-unification-consensus.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: address Phase 4 review P1 findings
1. Drop ReleaseLock on delete paths (containerprofilecache.go deleteContainer,
reconciler.go reconcileOnce). Security review flagged a race where the
deleted mutex could be orphaned while a concurrent GetLock creates a new
one, breaking mutual exclusion for the same container ID. Trade-off:
bounded memory growth of stale lock entries, proportional to container
churn — acceptable for a node-agent lifetime.
2. Extract emitOverlayMetrics helper (metrics.go) to de-duplicate the
~20-line overlay metric/deprecation-warn block between buildEntry
(addContainer path) and rebuildEntry (refresh path). Keeps the two in
lockstep — code review flagged silent drift risk.
Not addressed in this commit (plan-accepted tradeoffs, follow-up work):
- Shared-pointer read-only invariant is convention-enforced, not type-
enforced (plan v2 §2.3 step 7, ADR consequences). Retaining as-is;
downstream consumers must not mutate.
- Storage RPC context propagation (requires storage.ProfileClient interface
change, out of scope for this migration).
Plan: .omc/plans/containerprofile-cache-unification-consensus.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: retry pending ContainerProfile GETs when CP appears after container-start
Component tests on PR kubescape#788 regressed with "All alerts: []" and 54+
"container X not found in container-profile cache" log entries. Root cause:
addContainer did a one-shot GetContainerProfile at EventTypeAddContainer
time and bailed on 404. But the CP is created asynchronously by
containerprofilemanager ~60s AFTER container-start, so the one-shot GET
almost always missed; the cache entry was never created; rule evaluation
short-circuited as "no profile".
Legacy caches hid this via a periodic ListProfiles scan that picked up
late-arriving profiles on the next tick. The point-lookup model dropped
that scan. This commit adds an equivalent: a pending-container retry path
in the reconciler.
Changes:
- CachedContainerProfile unchanged; new pendingContainer struct captures
(container, sharedData, cpName) needed to retry the initial GET.
- ContainerProfileCacheImpl.pending SafeMap records containerIDs waiting
for their CP to land in storage.
- addContainer extracts the populate/GET into tryPopulateEntry. On miss
(err or nil CP) it records a pending entry; the per-container goroutine
exits. No more waiting 10 min inside addContainerWithTimeout.
- reconciler.retryPendingEntries iterates pending under per-container
locks, re-issues the GET, and promotes via tryPopulateEntry on success.
- reconcileOnce gains a pending GC pass: containers whose pod is gone or
whose status is not Running get dropped from pending so we don't retry
forever on terminated containers.
- deleteContainer also clears from pending on EventTypeRemoveContainer.
- metrics: cache_entries gauge gains a "pending" kind; reconciler
eviction counter gets a "pending_pod_stopped" reason.
Tests:
- TestRetryPendingEntries_CPCreatedAfterAdd: 404 on add -> pending; CP
arrives in storage -> one tick promotes; exactly 2 GetCP calls.
- TestRetryPendingEntries_PodGoneIsGCed: pending entry dropped when its
pod is no longer present in k8s cache.
Full findings and resume doc at
.omc/plans/containerprofile-cache-component-test-findings.md
Follow-up plan updated at
.omc/plans/containerprofile-cache-followups.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: cache correctness — right CP slug, partial-on-restart, overlay refs, resurrection guard
PR kubescape#788 component tests continued failing after the pending-retry fix.
Deep investigation uncovered a fundamental slug misuse and three reviewer-
reported correctness gaps. All fixed here.
### Primary bug: wrong slug function
plan v2 §2.3 asserted that GetOneTimeSlug(false) was deterministic. It is
NOT — implementation at k8s-interface v0.0.206:
func (id *InstanceID) GetOneTimeSlug(noContainer bool) (string, error) {
u := uuid.New()
hexSuffix := hex.EncodeToString(u[:])
...
}
So containerprofilemanager.saveContainerProfile writes a *time-series* CP
per tick with a fresh UUID suffix, and the storage-side
ContainerProfileProcessor.consolidateKeyTimeSeries writes the consolidated
profile at the STABLE slug (GetSlug(false), no UUID).
The cache was querying for CPs at GetOneTimeSlug(false), so every GET 404'd
forever — even with the pending-retry in place. 13 component tests failed
with "All alerts: []" and 38+ "container X not found in container-profile
cache" log entries.
Switched addContainer to GetSlug(false). The refresh path inherits the
corrected name via entry.CPName.
### Reviewer #1: resurrection during refresh
refreshAllEntries snapshots entries without a lock. Between snapshot and
per-entry lock acquisition, deleteContainer or reconcile-evict may have
removed the entry. Previously, rebuildEntry's c.entries.Set(id, newEntry)
would resurrect the dead container.
Added a load-under-lock guard at the top of refreshOneEntry.
### Reviewer #2: overlay handling regressions (two parts)
(a) tryPopulateEntry returned "pending" on base-CP 404 BEFORE trying
user-AP/NN. Containers with only a user-defined profile (no base CP yet)
got no entry. Restructured: fetch base CP and user-AP/NN independently;
populate if ANY source is available; synthesize an empty base CP when only
the overlay exists so projection has something to merge into.
(b) UserAPRef / UserNNRef were only recorded on successful fetch. A
transient 404 on add would permanently drop the overlay intent — the
refresh path had nothing to re-fetch. Now, when the label is set, the
refs are always recorded, using the label's name and the container's
namespace. Refresh retries the fetch each tick.
### Reviewer #3: partial profiles reused across container restart
tryPopulateEntry blindly used whatever CP existed at the stable slug,
including Partial completions from the previous container incarnation.
Legacy caches explicitly deleted Partial profiles on non-PreRunning
restart so rule evaluation fell through to "no profile" until Full
arrived.
Now: if CP.completion == Partial && !sharedData.PreRunningContainer, we
treat the CP as absent → stay pending → retry each tick. When the CP
becomes Full (or the container stops), the pending state resolves.
The inverse is preserved: PreRunningContainer (agent-restart scenario)
accepts the Partial CP as-is so Test_19's "alert on partial profile"
semantics still work.
### Tests
Five new unit tests, all race-clean:
- TestPartialCP_NonPreRunning_StaysPending
- TestPartialCP_PreRunning_Accepted
- TestOverlayLabel_TransientFetchFailure_RefsRetained
- TestRefreshDoesNotResurrectDeletedEntry
- TestUserDefinedProfileOnly_NoBaseCP
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: read workload-level AP/NN as primary data source
The storage server's consolidated ContainerProfile is not exposed via the
public k8s API — ContainerProfiles().Get(stableName) returns 404 even after
consolidation runs. Only time-series CPs (named <stable>-<UUID>) and the
server-aggregated ApplicationProfile / NetworkNeighborhood CRs at the
workload-name are queryable.
The component tests' WaitForApplicationProfileCompletion waits for the
workload-level AP/NN completion — that's what actually exists. The legacy
caches read these directly; we do the same now while the server-side
consolidated-CP plumbing is completed.
Changes:
- addContainer computes both cpName (per-container, forward-compat) and
workloadName (per-workload, where AP/NN live) via GetSlug(false) and
GetSlug(true) respectively.
- tryPopulateEntry fetches consolidated CP (kept for forward-compat),
workload AP, and workload NN. Treats the workload AP/NN as the primary
data source when the consolidated CP isn't available.
- projection pre-merges workloadAP + workloadNN onto the base (synthesized
when CP is 404), then buildEntry applies user-overlay AP/NN on top.
- Partial-on-restart gate extended to cover workload AP/NN too — non
PreRunning containers ignore partial workload profiles until they
become Full, mirroring legacy deletion-on-restart semantics.
- pendingContainer gains workloadName so retries re-fetch the right CRs.
- fakeProfileClient gains overlayOnly field so tests can scope AP/NN
returns to the overlay name; existing TestOverlayPath_DeepCopies updated
accordingly.
Forward-compat: once storage publishes a queryable consolidated CP at
cpName, its fetch becomes primary and the workload AP/NN path becomes a
fallback. No API changes are required to make that transition — just drop
the workload-level fetches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* debug: add tick-loop start log + change-detection log in reconciler
* fix: remove overly-aggressive pending GC that dropped entries before retry
CI run 24781030436 (commit ce32919) proved the reconciler IS ticking with
retryPendingEntries running, but the pending-GC pass in reconcileOnce was
dropping every pending entry on the first tick (pending_before=4 →
pending_after=0 at the FIRST tick, before retryPendingEntries could run).
Root cause: the GC pass asked k8sObjectCache.GetPod(ns, pod) and also
checked isContainerRunning. On a busy node, the k8s pod cache and
ContainerStatuses lag the containerwatcher Add event by tens of seconds.
So "pod not found" or "container not yet Running" routinely returned true
for a container that had just been registered, causing GC to remove the
pending entry immediately. Retries then ran against an empty pending map
→ no promotions → alerts fired without profile → test failure.
Change: remove the pending GC entirely. Cleanup for terminated containers
flows through deleteContainer (EventTypeRemoveContainer) which clears
both entries and pending under the per-container lock. Memory growth is
bounded by the node's container churn (containers that never got a
profile during their lifetime).
Test updated: TestRetryPendingEntries_PodGoneIsGCed replaced by
TestPendingEntriesAreNotGCedBeforeRetry which asserts the new semantics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: merge user-managed AP/NN and refresh workload-level sources
Two component-test regressions in PR kubescape#788:
Fix A (Test_12 / Test_13): the cache now reads the user-managed
ApplicationProfile and NetworkNeighborhood published at
"ug-<workloadName>" and projects them onto the base profile as a
dedicated ladder pass. Legacy caches did this via the
`kubescape.io/managed-by: User` annotation in handleUserManagedProfile;
we read them directly by their well-known name.
Fix B (Test_17 / Test_19): the reconciler refresh path re-fetches the
workload-level AP/NN (and user-managed / label-referenced overlays) on
every tick, not just the consolidated CP. This propagates the Status=
"ready" -> "completed" transition into the cached ProfileState, which
flips fail_on_profile from false to true at rule-eval time.
CachedContainerProfile gained WorkloadName plus WorkloadAPRV /
WorkloadNNRV / UserManagedAPRV / UserManagedNNRV fields so the refresh
can fast-skip when every source's RV matches what's cached.
refreshOneEntry's rebuild now runs the same projection ladder as
tryPopulateEntry: base CP (or synthesized) → workload AP+NN →
user-managed (ug-) AP+NN → label-referenced user AP+NN.
Also:
- Tick-loop log only fires when entries OR pending count actually moved
(previously fired whenever pending_before > 0, producing per-tick
noise while a stuck-pending entry waited for profile data).
- fakeProfileClient in tests returns userManagedAP/userManagedNN when
the requested name starts with "ug-".
- New tests: TestWorkloadAPMerged_AndRefreshUpdatesStatus (Fix B
happy-path) and TestUserManagedProfileMerged (Fix A happy-path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: reconcileOnce no longer evicts on pod-cache lag, only on Terminated
CI run 24783250693 (commit 32a76c0) showed reconcileOnce evicting live
entries every tick: "entries_before:10, entries_after:0" within 5 seconds
of the agent starting. Same class of bug as the pending-GC fix (c45803f):
the k8s pod cache lags ContainerCallback Add events by tens of seconds,
and evicting on "GetPod returns nil OR !isContainerRunning" churned every
entry before any rules could evaluate.
Change reconcileOnce eviction gate:
- If pod is missing from k8s cache: DO NOT evict. Cache lag is transient;
deleteContainer handles real-world cleanup via EventTypeRemoveContainer.
- If pod present and container has clearly Terminated: evict (preserves
init-container eviction for Test_02 and T2 acceptance).
- If pod present and container in Waiting state: retain (new container
creation, init-container pre-run both legitimately pass through Waiting).
New helper isContainerTerminated mirrors isContainerRunning but gates on
State.Terminated only; "not found in any status" treated as terminated.
Tests:
- TestReconcilerEvictsWhenPodMissing → TestReconcilerKeepsEntryWhenPodMissing
- New TestReconcilerEvictsTerminatedContainer
- New TestReconcilerKeepsWaitingContainer
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor: drop workload-level AP/NN fetch; CP-direct reading is authoritative
The workload-level AP/NN fetch added in d27be01 was a workaround for the
eviction/GC bugs (fixed in c45803f and d9ae0ac), not an architectural
need. The consolidated ContainerProfile IS queryable at the GetSlug(false)
name once storage aggregation runs; the cache simply needs to wait on the
pending-retry path.
This reverts the workload-AP/NN read while keeping:
- consolidated CP as the single base-profile source
- user-managed AP/NN at "ug-<workloadName>" (merged on top) — still needed
because user-managed profiles are authored independently and are not
consolidated into the CP server-side
- user-defined overlay via pod UserDefinedProfileMetadataKey label
- eviction fix (d9ae0ac), GC fix (c45803f), resurrection guard
Removed:
- workload-AP/NN fetch in tryPopulateEntry and refreshOneEntry
- WorkloadAPRV / WorkloadNNRV fields on CachedContainerProfile and the
corresponding rebuildEntryFromSources ladder pass
- Partial-on-restart gate for workload AP/NN (only applies to CP now)
- Synth-CP annotation fallback chain (simplified to Completed/Full)
Tests:
- TestWorkloadAPMerged_AndRefreshUpdatesStatus → TestRefreshUpdatesCPStatus
(CP now the source; RV transition propagates Status)
- TestUserManagedProfileMerged rewired to use a real base CP + ug- overlay
instead of workloadAP + ug- overlay
This matches the migration plan's original intent: CP-direct, AP/NN only
as user overlays.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: synthetic entry CPName override, PodUID backfill, phase-labeled reconciler histogram
Three review findings from the post-green audit.
### 1 (High) — synthetic entry stored the wrong CPName
When tryPopulateEntry synthesized a CP (consolidated CP still 404), the
synthetic name was workloadName or overlayName, and buildEntry persisted
entry.CPName = cp.Name (i.e. the synthetic name). refreshOneEntry then
queried the synthetic name instead of the real GetSlug(false) name; with
the stored RV also empty, the fast-skip's "absent matches empty" branch
kept the synthetic entry forever and the real consolidated CP could never
replace it.
Fix: after buildEntry, override entry.CPName = cpName (the real
GetSlug(false) result passed into tryPopulateEntry).
### 2 (Medium) — PodUID never backfilled
buildEntry only sets PodUID when the pod is already in k8sObjectCache at
add time. On busy nodes the pod cache lags, so addContainer often runs
before the pod lands and PodUID stays "". isContainerTerminated's
empty-ContainerID fallback matches against (ContainerName, PodUID);
when PodUID == "" and the status also has empty UID, the loop falls
through and returns true (treat as terminated) — evicting a still-live
init container. rebuildEntryFromSources copied prev.PodUID unchanged, so
the error never healed.
Fix: in rebuildEntryFromSources, if prev.PodUID is empty AND the pod is
now in the k8s cache, use the fresh UID.
### 3 (Low) — reconciler duration histogram mixed two phases
tickLoop (evict) and refreshAllEntries (refresh) both emitted
ReportContainerProfileReconcilerDuration into the same plain Histogram,
so nodeagent_containerprofile_reconciler_duration_seconds was a blend of
two very different workloads. Plan v2 §2.9 had specified a HistogramVec
with a "phase" label from the start.
Fix: MetricsManager.ReportContainerProfileReconcilerDuration(phase, d).
Prometheus implementation becomes a HistogramVec with phase label.
tickLoop emits phase="evict", refreshAllEntries emits phase="refresh".
MetricsMock/MetricsNoop signatures updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: address all CodeRabbit review comments on PR kubescape#788
- ContainerProfileCacheMock.GetContainerProfileState returns synthetic
error state instead of nil, matching the real impl's contract
- Remove IgnoreContainer check on EventTypeRemoveContainer to prevent
stale entries when pod labels change after Add
- Deep-copy userAP/userNN in mergeApplicationProfile and
mergeNetworkNeighborhood to eliminate aliasing of nested slices
(Execs[i].Args, Opens[i].Flags, MatchExpressions[i].Values, etc.)
into the cached ContainerProfile
- Fix Shared=true bug: buildEntry now takes userManagedApplied bool;
fast-path only sets Shared=true when no overlay was applied at all,
matching rebuildEntryFromSources logic in reconciler.go
- isContainerTerminated returns false when all status slices are empty
(kubelet lag guard for brand-new pods)
- Fix misplaced doc comment above GetContainerProfile in storage layer
- Remove unused (*stubStorage).setCP test helper
- Lock stress test evict path now uses ContainerCallback(Remove) to
exercise deleteContainer and per-container locking
- RuleObjectCacheMock stores per-container profiles in cpByContainerName;
GetContainerProfile resolves via InstanceID.GetContainerName();
GetContainerProfileState returns synthetic error state
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: thread context.Context through ProfileClient and add per-call RPC budget
All five ProfileClient methods now accept ctx as their first argument so
callers can enforce cancellation and deadline propagation. Each storage RPC
in the reconciler is wrapped via refreshRPC(ctx, ...) which applies a
configurable per-call timeout (config.StorageRPCBudget, default 5 s) on top
of the parent context, preventing a slow API server from stalling an entire
reconciler burst. Tests cover the fast-skip, rebuild, and context-cancellation
mid-RPC paths.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* test: shared-pointer race-fuzz test + WarmContainerLocksForTest helper
Add TestSharedPointerReadersDoNotCorruptCache: 50 concurrent readers
traverse the returned *ContainerProfile slices while a writer goroutine
alternately calls RefreshAllEntriesForTest + SeedEntryForTest to keep
entry rebuilds active. Runs for 500ms under -race, proving the shared-
pointer fast-path never produces a concurrent read/write pair.
Also add TestSharedPointerFastPathPreservesPointerIdentity: after a
refresh against a storage object with a newer RV, the new entry's
Profile pointer IS the storage object (Shared=true, no DeepCopy), which
keeps the T3 memory budget intact.
Fix the pre-existing goradd/maps SafeMap initialisation race in
TestLockStressAddEvictInterleaved by pre-warming containerLocks via the
new WarmContainerLocksForTest helper (the previous pre-warm via
SeedEntryForTest only covered the entries SafeMap, not containerLocks).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: document SetApplicationProfile / SetNetworkNeighborhood field partition in mock
Add a block comment above RuleObjectCacheMock spelling out the non-overlapping
cp.Spec field partition between the two setters and the first-container-wins
rule for r.cp. Without this, future callers risk aliasing NN fields into an
AP-only profile or vice-versa.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor: T8 integration mirror, mock setter contract doc, SeedEntryWithOverlayForTest
Add SeedEntryWithOverlayForTest helper so out-of-package integration tests can
set UserAPRef / UserNNRef (which use the internal namespacedName type) without
requiring the type to be exported.
Mirror TestT8_EndToEndRefreshUpdatesProjection at tests/containerprofilecache/
using only the public + test-helper API: seeds an entry with a stale UserAPRV,
mutates storage to apV2 (RV=51), asserts RefreshAllEntriesForTest rebuilds the
projection with the new execs and drops the stale ones.
Add top-of-file block comment to RuleObjectCacheMock documenting the non-
overlapping AP-fields / NN-fields partition between SetApplicationProfile and
SetNetworkNeighborhood and the first-container-wins rule for r.cp.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address Phase 4 code-review findings
- reconciler.go: simplify dead-code cpErr/rpcErr guard (refreshRPC returns
exactly cpErr; the rpcErr != nil && cpErr == nil branch could never fire)
- reconciler_test.go: make blockingProfileClient.blocked a buffered chan(1)
with a blocking send so the signal is never silently dropped; bump
rpcBudget to 100ms and timeout to 2s to reduce flakiness on loaded CI
- containerprofilecache.go: extract defaultStorageRPCBudget const alongside
defaultReconcileInterval for discoverability
- shared_pointer_race_test.go: fix gofmt const-block alignment
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: preserve cached entry when overlay AP/NN fetch fails transiently
Before this fix, a refreshRPC timeout on any overlay GET (user-managed
ug-<workload> AP/NN or user-defined label-referenced AP/NN) left the
overlay variable nil with the error silently discarded. The RV comparison
then saw rvOf(nil)="" != cached RV (e.g. "50"), treated it as a removal,
and rebuilt the entry without the overlay — temporarily stripping
user-managed/user-defined profile data from the cache and altering
alerting until the next successful tick.
Fix: capture each overlay's fetch error and, when it is non-nil and the
entry already has a non-empty cached RV for that overlay, return early
and keep the existing entry unchanged. Legitimate deletions (nil with
err==nil) still propagate correctly. Mirrors the existing CP error-
preservation logic at refreshOneEntry:272-288.
Add TestRefreshPreservesEntryOnTransientOverlayError covering all four
overlay fetch paths (user-managed AP, user-managed NN, user-defined AP,
user-defined NN) via a new overlayErrorClient stub.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address CodeRabbit review issues on PR kubescape#788
- Rename 5 CP cache metrics from nodeagent_* to node_agent_* to match
the existing metric namespace convention used across node-agent.
- Route all 5 storage GETs in tryPopulateEntry through refreshRPC so
they respect the per-call SLO (default 5s); prevents a hung GET from
blocking the entire reconciler tick loop when called from
retryPendingEntries.
- Add WarmPendingForTest helper to pre-initialise the pending SafeMap
before concurrent test phases, preventing the goradd/maps
nil-check-before-lock initialisation race.
- Pre-warm pending SafeMap in TestLockStressAddEvictInterleaved and
poll for async deleteContainer goroutines to drain before asserting
goroutine count.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: distinct RNG seed per stress-test worker
Pass worker index into each goroutine closure and mix it into the
rand.NewSource seed (time.Now().UnixNano() + int64(worker)), so that
100 concurrently-launched goroutines don't all receive the same
nanosecond timestamp and end up with identical add/evict sequences.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor: move test helpers out of production source into testing.go
The six *ForTest / ReconcileOnce helpers were previously mixed into
containerprofilecache.go alongside production logic. Move them to a
dedicated testing.go file in the same package.
export_test.go is the idiomatic alternative but is compiled only when
running tests in the same directory; test packages in other directories
(tests/containerprofilecache/) import the non-test version of the
package and never see _test.go contents. A plain testing.go is the
correct pattern here — it signals "test support" by name and groups all
scaffolding in one place, while remaining importable by any test binary.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor: move integration tests into package dir; use export_test.go
export_test.go (package containerprofilecache) is only compiled during
`go test` so test helpers never enter the production binary. This only
works when callers are in the same directory; the prior layout put tests
in tests/containerprofilecache/ (a separate package), forcing helpers
into a plain testing.go that shipped in the binary.
Moving the six test files into pkg/objectcache/containerprofilecache/
as package containerprofilecache_test fixes this correctly:
- export_test.go replaces testing.go (test-binary-only)
- package declaration: containerprofilecache_integration → containerprofilecache_test
- packages_deleted_test.go Dir path: ../.. → ../../.. (module root)
- tests/containerprofilecache/ directory removed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: nil out overlay pointers when k8s client returns zero-value on 404
The Kubernetes generated client (gentype.Client.Get) pre-allocates a
zero-value struct before the HTTP call and returns it as the result even
on error (e.g. 404 not-found). In refreshOneEntry, the four overlay
fetch paths (userManagedAP, userManagedNN, userAP, userNN) guarded only
the "transient error with cached RV → keep old entry" branch; the
"first-time 404, no cached RV" branch fell through with a non-nil
empty-ObjectMeta struct still in the pointer, which reached
rebuildEntryFromSources → emitOverlayMetrics and logged spurious
"user-authored legacy profile merged" warnings with empty
namespace/name/resourceVersion fields.
Add an explicit nil-out after each non-returning error branch, mirroring
the pattern already used in tryPopulateEntry.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
…vice file handling (kubescape#791) Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
…kubescape#798) * perf: disable file-digest/metadata/executable catalogers These three catalogers iterate every file in the scan tree and dominate transient allocation, but their outputs are not consumed by the OOM-relevant SBOM path. Disabling them saves ~200 MB peak RSS on gitlab-ee (main) and stacks with upstream selective-indexing + binary-prefilter improvements to ~1.12 GB total (vs 1.62 GB baseline, fits 1.5 GB cgroup). Signed-off-by: Ben <ben@armosec.io> * deps: switch to kubescape/syft v1.32.0-ks.2 for memory reduction Routes anchore/syft imports to the kubescape fork via replace directive. The fork carries selective indexing + binary-cataloger pre-filtering on top of v1.32.0; combined with the file-cataloger disable in the parent commit, this reduces gitlab-ee scan peak RSS from 1,621 MB to 1,123 MB. Refs: NAUT-1283 Signed-off-by: Ben <ben@armosec.io> * fix: check dep.Replace for actual fork version; add cataloger removals to sidecar - packageVersion() now returns dep.Replace.Version when present so the fork tag (v1.32.0-ks.2) propagates to runtime metadata and version-gating logic - pkg/sbomscanner/v1/server.go: add the same WithCatalogerSelection/WithRemovals as sbom_manager.go so both SBOM paths drop file-digest/metadata/executable catalogers and stay in consistent memory behaviour Signed-off-by: Ben <ben@armosec.io> * fix: keep syft tool version at required version Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Signed-off-by: Ben <ben@armosec.io> Co-authored-by: Matthias Bertschy <matthias.bertschy@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eCache Accepts upstream PR kubescape#788 deletion of the legacy workload-keyed caches pkg/objectcache/applicationprofilecache/ and pkg/objectcache/networkneighborhoodcache/ in favor of the new container-keyed ContainerProfileCache. Conflict resolutions: - 4 modify/delete conflicts (legacy cache .go + _test.go pairs): accept upstream deletes. - cmd/main.go: take upstream version, then re-apply the fork-only cfg-injection on the NewRulesWatcher call site (4-arg signature that was added on the fork). - go.mod / go.sum: keep fork's replace directive for k8sstormcenter/storage; bump k8s-interface v0.0.206 -> v0.0.208 to pick up helpersv1.LearningPeriodMetadataKey introduced in upstream PR kubescape#797. NOT in this merge — separate follow-up commits required: - Port the user-defined NetworkNeighborhood feature (fork commit fea3b06) onto containerprofilecache.CachedContainerProfile.UserNNRV. - Port the user-defined ApplicationProfile feature onto the UserAP overlay slot in the same struct. - Re-apply CompareExecArgs wiring on whatever wasExecutedWithArgs looks like after CP-cache integration (feat/exec-arg-wildcards). Build green, unit tests green across pkg/objectcache, pkg/rulemanager, pkg/containerprofilemanager. Component tests deferred until UDP features are ported.
entlein
pushed a commit
that referenced
this pull request
Apr 30, 2026
Upstream PR kubescape#788 (Replace AP and NN cache with CP) deleted the legacy applicationprofilecache where the fork's emitTamperAlert (commit c2d681e 'Feat/tamperalert' #22) lived. After the merge, R1016 alerts no longer fired for tampered user-defined profiles, breaking Test_31_TamperDetectionAlert (passed 3/3 on main, fails on the merged branch — confirmed regression introduced by PR #36). This restores the contract: every cache load of a user-supplied ApplicationProfile or NetworkNeighborhood overlay re-verifies the signature, and emits an R1016 'Signed profile tampered' alert through the rule-alert exporter when the signature is present but no longer valid. Alert shape preserved from the legacy cache so dashboards and component tests keep matching. Implementation: - new file pkg/objectcache/containerprofilecache/tamper_alert.go: verifyUserApplicationProfile / verifyUserNetworkNeighborhood / emitTamperAlert / extractWlidFromContainerID. Self-contained; keeps containerprofilecache.go diff small. - containerprofilecache.go: new tamperAlertExporter field + SetTamperAlertExporter setter + verify hooks immediately after GetApplicationProfile / GetNetworkNeighborhood succeed in the user-overlay branch of addContainer. - cmd/main.go: wire the alert exporter via SetTamperAlertExporter after the cache constructor (kept the constructor signature unchanged to avoid blast radius on tests). The setter is nil-safe: when no exporter is wired, verification still runs and is logged but no alert is emitted — matches the legacy behavior for unit-tests-with-no-exporter. Test_31 expanded from one scenario to four subtests, each in its own namespace to avoid alert cross-contamination: 31a tampered_user_defined_AP_fires_R1016 — original regression case 31b untampered_signed_AP_no_R1016 — negative: clean signature 31c unsigned_AP_no_R1016 — signing is opt-in 31d tampered_user_defined_NN_fires_R1016 — parallel NN code path Verified locally: - go build ./... clean - go test ./pkg/objectcache/containerprofilecache/... green - go test ./pkg/signature/... green - go vet -tags=component ./tests/... clean - go test -tags=component -run='^$' ./tests/... compiles
entlein
added a commit
that referenced
this pull request
May 10, 2026
* test(component): port Test_28 to upstream's unified user-overlay label Upstream PR kubescape#788 (Replace AP and NN cache with CP) collapsed the two legacy workload-keyed caches into a single ContainerProfileCache that reads ONE pod label `kubescape.io/user-defined-profile=<name>` and uses <name> as the lookup key for BOTH the user ApplicationProfile and the user NetworkNeighborhood. The fork's earlier two-label scheme (`user-defined-profile` for AP + separate `user-defined-network` for NN, with potentially different names) is no longer honored — the second label is silently ignored. Port: - tests/resources/nginx-user-defined-deployment.yaml: drop the `user-defined-network` label, point the surviving label at one shared name `curl-28-overlay`. - tests/component_test.go Test_28_UserDefinedNetworkNeighborhood: create both AP and NN under that single shared name. Assertions unchanged — the test still verifies that the user NN's egress restriction (only fusioncore.ai allowed on TCP/80) is enforced once the pod is running. Verified locally: go vet -tags=component ./tests/... clean; go test -tags=component -run='^$' ./tests/... compiles cleanly. * feat(cel): re-port CompareExecArgs hookup onto upstream's CP cache PR #35's wildcard-aware exec arg matching needs reapplication on top of the new ContainerProfileCache (upstream kubescape#788) baseline. The original PR sat on the legacy applicationprofilecache, which has been deleted; the call site now reads cp.Spec.Execs from a ContainerProfile. Same semantic change as PR #35: '⋯' (DynamicIdentifier) — matches exactly one argument position. '*' (WildcardIdentifier) — matches zero or more consecutive args. Wiring: - pkg/rulemanager/cel/libraries/applicationprofile/exec.go: drop slices.Compare exact-equality on the cp.Spec.Execs loop; route through dynamicpathdetector.CompareExecArgs. - go.mod: bump fork storage replace to feat/exec-arg-wildcards tip (3fc287210729) which carries the matcher. - exec_test.go: re-add TestExecWithArgsWildcardInProfile (13 subtests across curl --user ⋯, sh -c *, ls -l ⋯, echo hello *, plus negative literal-anchor / under-consumed-⋯ / mid-profile-* cases). Mirrors the test set that lived on PR #35 before the upstream merge. Verified: full applicationprofile package green (`go test ./pkg/rulemanager/cel/libraries/applicationprofile/`). * feat(rules): R0040 'Unexpected process arguments' + Test_32 e2e R0040 is an additive companion to R0001. It evaluates: ap.was_executed(...) && !ap.was_executed_with_args(..., event.args) so it ONLY fires when the exec'd path IS in the user-defined profile (R0001 stays silent) but the runtime arg vector does not match any profile entry's pattern. With wildcard tokens supported by dynamicpathdetector.CompareExecArgs: '⋯' (DynamicIdentifier) — exactly one argument position. '*' (WildcardIdentifier) — zero or more consecutive args. Use case: profile entry {Path: /bin/sh, Args: [sh, -c, *]} flags 'sh -x ...' as drift while permitting 'sh -c <anything>'. Wiring: - tests/chart/templates/node-agent/default-rules.yaml: new R0040 CEL rule definition immediately after R0001, same MITRE tagging (TA0002/T1059) and same applicationprofile-anomaly tag set. - tests/chart/templates/node-agent/default-rule-binding.yaml: R0040 added to the all-rules-all-pods binding next to R0001. - tests/resources/curl-exec-arg-wildcards-deployment.yaml: new fixture, curl pod labelled with the unified kubescape.io/user-defined-profile=curl-32-overlay label. - tests/component_test.go: Test_32_UnexpectedProcessArguments with 4 subtests: 32a sh_dash_c_matches_wildcard_trailing — sh -c <cmd> matches profile [sh, -c, *] — R0040 silent. 32b sh_dash_x_mismatches_R0040 — sh -x <cmd> mismatches the literal -c anchor — R0040 fires. 32c echo_hello_matches_wildcard_trailing — echo hello world matches [echo, hello, *] — R0040 silent. 32d echo_goodbye_mismatches_R0040 — echo goodbye world mismatches the literal hello anchor — R0040 fires. Verified locally: go vet -tags=component ./tests/... clean; go test -tags=component -run='^$' ./tests/... compiles cleanly. End-to-end alert assertions run in CI. * deps(storage): bump to rebased feat/exec-arg-wildcards tip (0de34ebc) After rebasing storage feat/exec-arg-wildcards onto storage main, the matcher branch now sits on top of the latest fork main commit (352395a3 — Internal-field merge fix). Bump the node-agent storage replace to that new pseudo-version so this branch's tests run against storage main + matcher in one consistent baseline. Verified locally: 47/47 non-eBPF unit packages green; vet clean; the applicationprofile CEL package's TestExecWithArgsWildcardInProfile is 13/13 green; component-tests compile under the component tag. The two failing packages (pkg/containerwatcher/v2/tracers and pkg/validator) fail with the same pre-existing /sys/fs/bpf mount-permission error they have on every recent run — env, not code. * ci(component-tests): add Test_32_UnexpectedProcessArguments to matrix The component-tests workflow uses a hardcoded matrix list, not a dynamic discovery from the test source. Test_32 (added in a613cf6) must be listed explicitly to be picked up — without this entry the test is silently skipped. * fix(containerprofilecache): re-wire R1016 tamper alert + expand Test_31 Upstream PR kubescape#788 (Replace AP and NN cache with CP) deleted the legacy applicationprofilecache where the fork's emitTamperAlert (commit c2d681e 'Feat/tamperalert' #22) lived. After the merge, R1016 alerts no longer fired for tampered user-defined profiles, breaking Test_31_TamperDetectionAlert (passed 3/3 on main, fails on the merged branch — confirmed regression introduced by PR #36). This restores the contract: every cache load of a user-supplied ApplicationProfile or NetworkNeighborhood overlay re-verifies the signature, and emits an R1016 'Signed profile tampered' alert through the rule-alert exporter when the signature is present but no longer valid. Alert shape preserved from the legacy cache so dashboards and component tests keep matching. Implementation: - new file pkg/objectcache/containerprofilecache/tamper_alert.go: verifyUserApplicationProfile / verifyUserNetworkNeighborhood / emitTamperAlert / extractWlidFromContainerID. Self-contained; keeps containerprofilecache.go diff small. - containerprofilecache.go: new tamperAlertExporter field + SetTamperAlertExporter setter + verify hooks immediately after GetApplicationProfile / GetNetworkNeighborhood succeed in the user-overlay branch of addContainer. - cmd/main.go: wire the alert exporter via SetTamperAlertExporter after the cache constructor (kept the constructor signature unchanged to avoid blast radius on tests). The setter is nil-safe: when no exporter is wired, verification still runs and is logged but no alert is emitted — matches the legacy behavior for unit-tests-with-no-exporter. Test_31 expanded from one scenario to four subtests, each in its own namespace to avoid alert cross-contamination: 31a tampered_user_defined_AP_fires_R1016 — original regression case 31b untampered_signed_AP_no_R1016 — negative: clean signature 31c unsigned_AP_no_R1016 — signing is opt-in 31d tampered_user_defined_NN_fires_R1016 — parallel NN code path Verified locally: - go build ./... clean - go test ./pkg/objectcache/containerprofilecache/... green - go test ./pkg/signature/... green - go vet -tags=component ./tests/... clean - go test -tags=component -run='^$' ./tests/... compiles * test(component): Test_32 profile uses full-path argv[0] Empirical finding from CI run 25178930763 — Test_32's positive subtests (32a sh_dash_c_matches, 32c echo_hello_matches) fired R0040 when they should not. Cause: at runtime, the eBPF tracer captures argv[0] as the FULL exec path (e.g. "/bin/sh") rather than the basename ("sh"). My profile entries used basenames, so the matcher's first-position literal compare missed and the cache fell through to 'no exec entry matches' — R0040 fires. Aligns Test_32's profile with the convention already used by Test_27's wildcard_yaml_profile_allowed_opens fixture (known-application-profile-wildcards.yaml predecessor): argv[0] is the full path, subsequent positions are flags/values. Subtest expectations after this fix: 32a sh -c <cmd> → matches [/bin/sh, -c, *] → R0040 silent 32b sh -x <cmd> → -c anchor mismatch → R0040 fires 32c echo hello <…> → matches [/bin/echo, hello, *]→ R0040 silent 32d echo goodbye <…> → hello anchor mismatch → R0040 fires * test: AP-fixture linter (R-AP-* rules) + canonical reference profile Catches the class of bug Test_32 hit on its first CI run (PR #37 run 25178930763): profile entries used basename argv[0] ("sh") while the eBPF tracer captures the full path ("/bin/sh"), so the matcher silently misses and the rule fires when it shouldn't. Without a linter, this kind of fixture drift only surfaces in a 15-minute component-test run on a kind cluster — too late, too expensive. The linter (LintApplicationProfile / LintApplicationProfileYAML in tests/resources/aplint_test.go) is intentionally written as a pure function returning []Violation. Zero testing-package coupling on the hot path so it can be lifted into a future bobctl subcommand `bobctl lint <ap.yaml>` without rewrite — see backlog at ~/biz/sbob-business-plan/state.yaml. Rules: R-AP-01 — kind must be ApplicationProfile R-AP-02 — at least one container R-AP-03 — container.name non-empty R-AP-10 — exec.path absolute (catches relative paths) R-AP-11 — exec.path no wildcards (binary identity is exact) R-AP-12 — exec.args[0] equals exec.path or wildcard token (Test_32-style argv[0] basename trap) R-AP-13 — exec.args wildcard tokens are whole-word (no embedding) R-AP-20 — open.path non-empty + absolute R-AP-21 — open.flags non-empty (real auto-recorded opens always have ≥1) R-AP-22 — open.flags from known O_* set (catches typos) Each rule has a dedicated self-test that constructs a minimal-bad YAML and asserts the rule fires (5 negative tests). One positive test (TestLinter_canonical_AP_passes) parses the fork's reference known-application-profile.yaml — extracted from a real auto-recorded AP for curlimages/curl:8.5.0 in fea3b06 — and asserts zero violations. The reference YAML is restored to tests/resources/ so the canonical shape is in-tree and visible to humans + CI. Why a Go test rather than a shell linter: keeps the rule set in the same language as the storage matcher (`dynamicpathdetector`), so extending CompareExecArgs and the linter together stays cheap. Local-cluster organic learning was the original plan but k3s on OrbStack is currently flapping (LXC-related boot loop). The fea3b06 profile was extracted from real auto-learning at an earlier moment of stability, which is the next-best ground truth. * fix(tamper_alert): accept self-signed profiles, only flag actual tamper Switch verifyUser{ApplicationProfile,NetworkNeighborhood} from strict VerifyObject to VerifyObjectAllowUntrusted. The strict variant requires a Sigstore Fulcio trust chain and rejects locally-signed profiles even when the signature against the embedded cert is valid. That made Test_31b 'untampered_signed_AP_no_R1016' fire R1016 against an untampered AP, and broke Test_30's 'tampered_profile_loaded_without_ enforcement' subtest the same way. The intent is: tamper detection, not trust-chain enforcement. Matches cmd/sign-object/main.go's default verifier. * test(component): make Test_30 30b deterministic by re-execing inside Eventually The single-shot wget exec before Eventually was racy: if the eBPF event landed before the CP cache projected the user-defined AP, the rule manager evaluated against an empty baseline and R0001 never fired within the 60s polling window. Same race Test_29 already documents. Drive the wget exec inside the Eventually loop (10s tick, 120s deadline) so cache-load latency is absorbed by retries. Filter R0001 to comm=wget to make the assertion specific instead of catching any R0001. Drops the blind 15s pre-sleep and the redundant settle-then-recount block. * deps(storage): bump replace to f44fed80 (analyzer trailing-* fix) Picks up the upstream-PR-kubescape#316 review fix: trailing WildcardIdentifier now requires at least one regular-path segment, matching standard glob semantics. Closes the R0002 blind spot where '/etc/*' would silently match the bare '/etc' directory. * deps(storage): bump replace to 4ab95fb8 (PR #25 merged on fork main) Pulls in the full PR-kubescape#316 review fix set that just landed on storage main: proper splitPath-based trailing-* anchoring, DefaultCollapseConfigs() defensive-copy accessor, FindConfigForPath value-return, splitEndpoint defensive guard, plus the BenchmarkCompareDynamic baseline. * test(component): Test_33_AnalyzeOpensWildcardAnchoring End-to-end pin of the storage-side CompareDynamic contract through R0002. Each subtest deploys a fresh nginx pod with a user-defined AP carrying ONE Opens entry, then `cat`s a target path that probes a boundary case from the storage analyzer fixes (kubescape/storage kubescape#316 review by matthyx + entlein): - Anchored trailing `*` matches one OR MORE remaining segments — never zero. So /etc/* matches /etc/passwd but NOT bare /etc. - DynamicIdentifier (⋯) consumes EXACTLY ONE segment. - Mid-path `*` is zero-or-more, so /etc/*/* matches /etc/ssh (inner * consumes zero, trailing * consumes one). - Mixed ⋯/* combinations: ⋯ pins one, * consumes the rest. - splitPath normalises trailing slashes on both sides. 11 subtests covering: trailing_star_matches_immediate_child — basic /etc/* match trailing_star_matches_deep_child — multi-segment under prefix trailing_star_does_not_match_bare_parent — the security fix deep_prefix_trailing_star_does_not_match_parent — same rule, deeper ellipsis_pin_one_segment_then_literal — ⋯ rejects zero ellipsis_then_trailing_star_matches_two_* — ⋯/* combo, 2 levels ellipsis_then_trailing_star_matches_three_* — ⋯/* combo, 3 levels double_trailing_matches_one_child — /*/* mid-zero double_trailing_matches_deep_child — /*/* mid-one double_trailing_does_not_match_parent — /*/* needs ≥1 child trailing_slash_in_profile_normalises_to_literal — splitPath on profile Pinned at component level on TOP of the unit suite in storage/pkg/registry/file/dynamicpathdetector/tests/coverage_test.go. Both layers must agree — a drift in either lights up R0002 with a false positive or false negative. Matrix entry added to component-tests.yaml so the test runs in CI. * test(component): rework Test_33 negative cases to probe under R0002's monitored prefix R0002's CEL ruleExpression has a strict path-prefix filter: event.path.startsWith('/etc/') || event.path.startsWith('/var/log/') || ... All with trailing slash. Bare /etc and /var/log don't pass the filter, so R0002 never evaluates on those events — the matcher's bare-parent anchoring contract stays invisible at runtime even though the storage unit tests pin it. Probe one level deeper instead: /etc/ssl IS under the /etc/ monitored prefix, so the rule CAN see whether a /etc/ssl/* profile entry matches the bare /etc/ssl parent. Same security guarantee, observable layer. Reworked subtests: - trailing_star_does_not_match_bare_parent_under_monitored_prefix: profile /etc/ssl/*, cat /etc/ssl → R0002 fires - deep_prefix_trailing_star_does_not_match_parent: profile /etc/ssl/certs/*, cat /etc/ssl/certs → R0002 fires - ellipsis_requires_one_segment_not_zero: profile /etc/passwd/⋯, cat /etc/passwd → ⋯ requires one more segment - double_trailing_does_not_match_parent_under_monitored_prefix: profile /etc/ssl/*/*, cat /etc/ssl → R0002 fires The 7 positive subtests that already passed are untouched. Added a comment block documenting why we probe at /etc/ssl rather than /etc. * test(component): fix Test_28 + Test_31 31b flakiness Two distinct fixes for what looked like the same intermittent failures across PR #37 runs: Test_31 31b 'untampered_signed_AP_no_R1016' — root cause: storage's PreSave runs DeflateSortString on Syscalls (and Capabilities, Architectures), which sorts + dedupes. The signSignedAP helper signed the AP BEFORE pushing, against unsorted syscalls {socket, connect, read, write, close, openat}. After PreSave the stored AP had sorted {close, connect, openat, read, socket, write}, so the content hash differed from the signature → server-side verify correctly failed → R1016 fired even though the profile was untampered. Test_29 + Test_30 30b had the same fixture but didn't observe the bug because they only assert R0001 counts, never R1016. Pre-sort the syscalls in all three test fixtures so storage's normalization is a no-op on round-trip. Test_28 28a 'allowed_fusioncore_no_alert' — root cause: 15s post-deploy sleep wasn't always enough for the upstream ContainerProfileCache to project the user-defined NN. Failure mode is alert payload `profileMetadata.errorMessage:"waiting for profile update"` — the rule manager evaluated against an unloaded NN and fired R0005/R0011 spuriously. Bumped to 30s with a comment documenting why. A real fix would poll a cache-loaded signal but no such signal is exposed from outside the node-agent today. * test(component): sign-after-roundtrip in Test_31 to defeat content-drift R1016 false positive Test_31 31b 'untampered_signed_AP_no_R1016' kept flaking because the AP's content hash drifted between client-side sign and server-side verify across the K8s/storage roundtrip. Sources of drift include storage's PreSave normalisation (DeflateSortString, DeflateStringer, DeflateRulePolicies), signature/profiles GetContent's nil→empty-map mutation on PolicyByRuleId, and any K8s server-side defaulting of spec/metadata fields. Pre-sorting Syscalls in the previous fix only covered one of these. Sign-after-roundtrip closes the whole class: 1. Push the AP UNSIGNED to storage. PreSave runs, normalises content. 2. Read it back — this is what node-agent will see at verify time. 3. Sign THAT normalised content. 4. Delete the unsigned in-storage copy so deployAndWait can Create the signed version without an AlreadyExists conflict. 5. Strip server-managed metadata (resourceVersion / uid / etc.) from the returned AP so the second Create succeeds cleanly. Second push goes through deflate again. Idempotent on already-normalised content → stored bytes identical to signed bytes → content hash matches → verify succeeds → no R1016 false positive. Tampered subtests (31a, 31d) keep working: signSignedAP returns a known-good signed AP, the test mutates it post-helper, deployAndWait Creates the mutated version, storage round-trip preserves the mutation, and verify correctly detects the divergence. * test(component): bump Test_33 WaitForReady to 180s for cluster-pressure tolerance Test_33 deploys 11 fresh pods sequentially, one per subtest. Later subtests race against an increasingly loaded kind cluster — CP cache reconciler, alertmanager, prometheus all chew CPU at boot. 80s WaitForReady deadline timed out on the post-23ea224 run with 'workload not ready in ns ...' for early subtests once the cluster got busy. 180s gives headroom without changing total runtime regime. * deps(storage): bump replace to 43795bb4 (storage feat/exec-arg-wildcards tip — CRDs + CompareExecArgs) * test(aplint): drop redundant p := p loop var (Go 1.22+, copyloopvar lint) CodeRabbit (PR #38). Loop variable shadowing is no longer required since Go 1.22 made each iteration capture its own variable. The shadow trips golangci-lint's copyloopvar check. * fix(tamper_alert): R1016 dedup + use real WLID CodeRabbit (PR #38) flagged two issues on the refresh-loop tamper path: 1. Dedup R1016 (Major). The cache refresh re-runs verifyUser*() on every reconcile cycle, so a long-lived tampered profile would emit R1016 on every cycle and once per container reference. Track emitted alerts in a sync.Map keyed on (kind|ns/name@resourceVersion). LoadOrStore gives atomic 'first transition to invalid' semantics; a re-tamper at a new RV uses a fresh key and alerts again. Verification passing clears the key so future tampers re-alert. 2. Pass real WLID (Major). emitTamperAlert previously called extractWlidFromContainerID(containerID) — but containerID is a runtime ID like 'containerd://...' which GenericRuleFailure.SetWorkloadDetails parses as a WLID and silently drops kind/name/cluster attribution. Thread sharedData.Wlid (already in scope at the call site in containerprofilecache.go) through verifyUser*() into emitTamperAlert. Drop extractWlidFromContainerID — no longer needed. Test_31_TamperDetectionAlert exercises this path end-to-end; expecting it to keep passing (one alert per tampered profile, with proper workload attribution in Alertmanager). * deps(storage): bump replace to b0d68d3d (empty-Args wildcard match) Picks up storage's CompareExecArgs fix for path-only Execs entries (CodeRabbit PR #38, finding #3). With this bump, R0040 no longer fires on every legit exec of a path that the user-defined profile allowed but didn't constrain by argv. * fix: address CodeRabbit second-review batch on PR #38 Three CR concerns addressed in one commit (all submitted 2026-05-04 10:09): 1. tamper_alert.go (Major) — only emit R1016 on actual signature mismatch. Previously every error from VerifyObjectAllowUntrusted (hash computation, verifier construction, malformed annotations, signature mismatch) was treated as a tamper event. With EnableSignatureVerification=true that meant a transient operational failure could drop a valid overlay AND emit a false R1016. Fix: - Add signature.ErrSignatureMismatch sentinel in pkg/signature/annotations.go - Wrap the signature-fail return in verify.go with the sentinel (Go 1.20+ multiple-%w form) - Classify in tamper_alert.go via errors.Is(err, ErrSignatureMismatch); operational errors log with a 'NOT tamper' tag, do NOT touch the tamperEmitted dedup map, and do NOT emit R1016. They still honour strict-mode (return !EnableSignatureVerification) so behaviour is conservative. Adds tamper_alert_test.go: pins LoadOrStore semantics, confirms each operational-error variant returns false on errors.Is(ErrSignatureMismatch), smoke-tests the sentinel value. 2. aplint_test.go:221 (Minor) — replace strings.Contains substring check for fixture-type detection with proper YAML-header parse. The substring form silently skipped fixtures with quoted/non-canonical 'kind' values AND would false-positive on nested OwnerReferences with the same substring. Now uses sigs.k8s.io/yaml (already imported) to unmarshal into a {Kind string} struct and branch on the parsed value. 3. aplint_test.go R-AP-12 (Major, 'Heavy lift') — REJECTED with reasoning to be posted as CR reply. Evidence shows R-AP-12 enforces the actual runtime contract: - pkg/rulemanager/cel/libraries/parse/parse.go's getExecPath returns args[0] (the full binary path) - ap.was_executed_with_args(containerID, parse.get_exec_path(event.args, event.comm), event.args) passes the FULL argv to the matcher - Integration test pkg/rulemanager/cel/libraries/applicationprofile/ integration_test.go uses ["/bin/bash", "-c", ...] — full argv - Real fixture tests/resources/known-application-profile.yaml uses args: ["/bin/sleep", "infinity"] with args[0] == path The compare_exec_args_test.go cases that look like flags-only ([-s, ⋯]) are matcher-isolation unit tests, not the wired contract. Removing R-AP-12 would let users author flags-only profiles that silently fail to silence R0040 in production. Bumps storage replace to a7e6234349ab (storage main HEAD post-PR #27). * fix: address CodeRabbit third-review batch on PR #38 (0cf4a50) Three new findings on the May 9 review batch: 1. aplint_test.go R-AP-13 (Major) — embedded * was not flagged, only embedded ⋯. An arg like "foo*bar" linted clean, defeating the purpose of the wildcard-must-be-standalone rule. Added a parallel check for embedded wildcardIdentifier mirroring the existing dynamicIdentifier check. 2. aplint_test.go canonical-fixture skip path (Major) — t.Skipf on missing reference fixture silently disabled the gold-standard test if someone deleted/renamed it. Switched to t.Fatalf with a message that explains the file's role. Also switched the YAML-parse-failure path from Skipf to Fatalf — un-parseable YAML in tests/resources/ is a fixture-quality bug, not something to skip past. Kept Skipf only for the kind!=AP path, which correctly handles the legit non-AP fixtures (Deployments, Services etc.) that share the directory. 3. tamper_alert_test.go (Trivial nitpick, 'Low value') — added an integration-style test that exercises the full verifyUserApplicationProfile path: real signature.SignObjectDisableKeyless, tamper the spec, call verifyUserApplicationProfile, assert tamperEmitted carries the key. Then re-sign at a new RV and assert the dedup is cleared. Confirms the wiring 'verifier returns ErrSignatureMismatch → classify as tamper → LoadOrStore in dedup map' actually fires, not just the LoadOrStore call in isolation. Uses the real cosign adapter — no mocking needed. * fix(test): make dedup-clearing assertion non-trivial CodeRabbit nitpick on PR #38 (tamper_alert_test.go:159): the prior version of TestVerifyAP_TamperedProfile_PopulatesDedupMap bumped the profile's ResourceVersion to '43' before the re-sign, then asserted that the key for RV='43' was absent from tamperEmitted. But '43' was never added in the first place — only '42' was added during the tamper phase. The assertion was trivially true. Drop the RV bump: re-sign at the SAME RV='42' so that verifyUserApplicationProfile takes the verify-clean branch and Deletes the existing dedup entry. The assertion now actually exercises the clearing path. Production code unchanged. * fix(ci): drop 'permissions: read-all' from reusable workflows Chore PR #39 added 'permissions: read-all' to every workflow file including the workflow_call reusables. That broke the reusable-call contract: a called workflow's top-level permissions are the FLOOR the caller must grant. Callers of these reusables (pr-created.yaml's pr-created job, pr-merged.yaml's pr-merged job, etc.) only grant a narrow set of scopes (id-token:write, packages:write, security-events: write, pull-requests:write, contents). They do NOT grant the full read-all set (actions:read, artifact-metadata:read, attestations:read, checks:read, deployments:read, discussions:read, issues:read, models:read, vulnerability-alerts:read, packages:read, pages:read, repository-projects:read, statuses:read, id-token:read). Result: every PR opened on this fork since chore #39 merged (May 6) has had pr-created.yaml startup_failure with: Error calling workflow ... incluster-comp-pr-created.yaml@<sha>. The workflow is requesting 'actions: read, ...', but is only allowed [...] Fix: strip top-level 'permissions: read-all' from the four reusables: - benchmark.yaml (also direct-trigger; falls back to per-job perms) - go-basic-tests.yaml - incluster-comp-pr-created.yaml - incluster-comp-pr-merged.yaml Each reusable's per-job 'permissions:' blocks remain intact and correctly request only what the job needs. Callers' existing grants are sufficient. Top-level 'permissions: read-all' stays on the OUTERMOST workflows (pr-created.yaml, pr-merged.yaml, build.yaml, bypass.yaml, component-tests.yaml, sign-object.yaml, scorecard.yml) — they're event-triggered, not workflow_call'd, so the read-all default still hardens GITHUB_TOKEN there. --------- Co-authored-by: Entlein <eineintlein@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Merges upstream kubescape/node-agent main into the fork. Upstream PR kubescape#788 (
Replace AP and NN cache with CP) replaces the two legacy workload-keyed caches (applicationprofilecache,networkneighborhoodcache) with a single container-keyedContainerProfileCache.packages_deleted_test.goenforces that the legacy packages stay deleted.This merge accepts upstream's intent rather than fighting it: the legacy packages are deleted, the new
containerprofilecachelands, and the fork's surviving features keep their behavior on the new structure.Conflict resolutions
pkg/objectcache/applicationprofilecache/{*.go,*_test.go,_interface.go}containerprofilecachepkg/objectcache/networkneighborhoodcache/*pkg/objectcache/applicationprofilecache/callstackcache/pkg/objectcache/callstackcache/(domain-agnostic)cmd/main.go&cfgarg onNewRulesWatcher(...)go.mod/go.sumreplacefork8sstormcenter/storage; bumpk8s-interfacev0.0.206 → v0.0.208helpersv1.LearningPeriodMetadataKeyintroduced in upstream kubescape#797Verified locally
go build ./...cleango test ./pkg/objectcache/... ./pkg/rulemanager/... ./pkg/containerprofilemanager/...all greenOut of scope — separate follow-ups required
fea3b062)containerprofilecache.CachedContainerProfile.UserNNRVoverlay slotCachedContainerProfile.UserAPoverlay slotCompareExecArgswiring (PR #35 / branchfeat/exec-arg-wildcards)wasExecutedWithArgslooks like after the CP-cache rewire of consumersTest plan
Storage replace pin
github.com/k8sstormcenter/storage v0.0.240-0.20260429052903-0e0366026f05— does NOT yet include the storage Internal-field merge fix (commit352395a3). Bump in a separate follow-up if you want CI to pick up that fix here.